Case Study

Reverse Engineering Airbnb's Search Ranking Algorithm

Building a Python web scraping pipeline to analyse 1,600+ listings and uncover the factors that drive search visibility on Airbnb.

Market: Cape Town, South Africa
Type: Personal Research
Completed: 2017

Project Overview

As a part-time property manager with a portfolio of rental properties in Cape Town, I wanted to understand what factors influence search rankings on Airbnb. Higher rankings mean more visibility, more bookings, and ultimately more revenue.

Airbnb doesn't publish their ranking algorithm, so I built a data pipeline to scrape listing data at scale, store it in a relational database, and perform statistical correlation analysis to reverse-engineer the factors that matter most.

1,638
Listings Analysed
27
Factors Tested
34
Amenities Ranked
17
Search Pages Scraped
Python Scrapy SQL QlikView

The Data Pipeline

Since Airbnb doesn't provide a public API, I built a custom web scraping solution to gather the data needed for analysis:

🕷️

Scrape

Python Scrapy spider

🔄

Parse

Extract JSON from HTML

🗄️

Store

Relational database

📊

Analyse

Correlation statistics

Target Data Points

Each Airbnb listing displays surface-level information like price and reviews, but much richer data is hidden in the page's underlying JSON structure. I identified the key fields to extract:

Data extracted from each listing
Airbnb listing showing extracted data points: price, reviews, name, beds, guests

Building the Scrapy Spider

I built a Python web spider using the Scrapy framework to systematically crawl Airbnb's search results. The spider was configured to target listings accommodating 6+ guests within specific GPS coordinates covering greater Cape Town.

airbnb_spider.py
Python Scrapy spider code showing URL construction with GPS coordinates and pagination

The spider constructs URLs dynamically, appending search parameters (guest count, GPS boundaries) and iterating through all result pages automatically.

Parsing Hidden JSON Data

The real value came from extracting data embedded in each listing's HTML as JSON. This included fields not visible on the surface - satisfaction scores, response rates, calendar update frequency, and 30+ other variables.

parse_listing.py
Python code parsing JSON data from Airbnb listing pages

Technical Note

The JSON was embedded in a script tag within the HTML. Using XPath to extract the content and Python's json library to parse it gave access to far more data than the visible page elements.

Database Architecture

The scraped data was normalised into a relational structure with five interconnected tables, enabling flexible analysis across multiple dimensions:

Entity Relationship Diagram
Database ERD showing relationships between Scrapy Data, io Data, Amenities, Suburb, and Gender tables

The central Scrapy Data table contained the detailed listing information, joined to amenity codes (with a separate lookup table for descriptions), suburb data for location analysis, and a gender classification table to test whether host gender correlated with rankings.

Key Findings

Guest Satisfaction: The Dominant Factor

The most striking finding was the clear linear relationship between guest satisfaction scores and search ranking. Page 1 listings averaged 83.7% satisfaction, declining steadily to 42.9% on page 17.

Average Guest Satisfaction by Search Result Page
Table showing guest satisfaction declining from 83.7% on page 1 to 42.9% on page 17

Clear inverse correlation: as page number increases, average guest satisfaction decreases.

Top 5 Ranking Factors

Statistical correlation analysis revealed the five factors most strongly associated with higher search rankings:

Factors 1-5: Strongest Correlation with Search Rank
Data table showing Guest Satisfaction, Price, Word Count, Minimum Nights, and Days since Calendar Update by page

1 Guest Satisfaction (0.906)

The composite score from guest reviews is by far the most important factor. Airbnb clearly prioritises properties that deliver great experiences.

2 Price (0.901)

Lower prices correlate with better rankings. Airbnb wants to show users the best deals - competitive pricing is rewarded.

3 Description Word Count (0.897)

Listings with detailed descriptions rank higher. This could be algorithmic, or it may indicate that thorough hosts perform better overall.

4 Minimum Stay Length (0.885)

Shorter minimum stays correlate with higher rankings - flexibility is rewarded.

5 Calendar Activity (0.884)

Hosts who frequently update their calendar rank higher. Airbnb rewards active engagement with the platform.

Factors 6-10

Factors 6-10: Secondary Correlation Factors
Data table showing Price/Bed, Name Length, InstantBook %, Reviews, and Wishlist saves by page

Notable findings from the secondary factors include the importance of Instant Book (0.844 correlation), review count (0.828), and wishlist saves (0.819).

The Complete Correlation Table

All 27 factors tested, ranked by their correlation coefficient with search page position:

All Factors Ranked by Correlation to Search Position
Complete ranking table showing all 27 factors and their correlation coefficients

Green = strong positive correlation with ranking. Red = weak or no correlation. Coefficients converted to absolute values for easier interpretation.

Surprising Non-Factors

SuperHost status ranked only 13th (0.783) - not as impactful as expected. Account age, allowing smoking, pet-friendly policies, and having "view" in the title showed almost no correlation with ranking.

Detecting Algorithm Changes Over Time

By comparing data from February 2016 and February 2017, I was able to detect a significant shift in how Airbnb treats Instant Book listings:

Instant Book Correlation: 2016 vs 2017
Side-by-side comparison showing Instant Book went from 21.1% average in 2016 to 27.1% in 2017 with clear correlation to page rank

February 2016

0.14
Correlation coefficient
(No meaningful relationship)

February 2017

0.84
Correlation coefficient
(Strong positive correlation)

This aligned with Airbnb's September 2016 announcement that they would "accelerate the use of Instant Book" to reach one million listings by January 2017, partly as a measure to reduce discrimination in the booking process.

Amenity Correlation Analysis

I also analysed which amenities correlated most strongly with higher search rankings:

Amenities Ranked by Correlation to Search Position
Table showing 34 amenities ranked by correlation, with Iron, Laptop Workspace, and Hangers at the top

Red highlighted rows indicate "Business Ready" required amenities - notably these cluster in the top half of correlations.

Interestingly, "Business Ready" required amenities (iron, laptop workspace, hangers, hair dryer, essentials) all showed strong correlation with better rankings. Meanwhile, TV and Cable TV showed almost no correlation - perhaps less important for properties hosting 6+ guests.

Actionable Recommendations

Based on the correlation analysis, here are the most impactful actions a host can take to improve search ranking:

# Action Why It Works
1 Keep your calendar updated Shows active engagement; Airbnb rewards responsive hosts
2 Ask guests to complete reviews Guest satisfaction is the #1 ranking factor
3 Price competitively Airbnb favours listings that offer value
4 Lower minimum stay requirements Flexibility correlates with higher rankings
5 Enable Instant Book Algorithm now strongly favours instant bookability
6 Respond quickly to requests Response rate and time both correlate with ranking
7 Add business-ready amenities Iron, desk, hangers, hair dryer, essentials all help

Project Impact

This research project delivered practical insights that I applied directly to the properties I manage, and it laid the groundwork for larger-scale research collaborations with professional property management companies.

What Came Next

This methodology was later expanded into a multi-market study with Angel Host, analysing how ranking factors vary by geographic location - proving that a one-size-fits-all optimisation strategy doesn't work.

The project demonstrated that with the right technical approach - web scraping, data engineering, and statistical analysis - it's possible to reverse-engineer opaque algorithms and extract actionable business intelligence.

Need Custom Data Solutions?

I build data pipelines, web scrapers, and analytics solutions that turn raw data into competitive advantage.

Get in Touch